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ABSTRACT 

Motivation: Recent advances in brain imaging and high-throughput 
genotyping techniques enable new approaches to study the 
influence of genetic and anatomical variations on brain functions 
and disorders. Traditional association studies typically perform 
independent and pairwise analysis among neuroimaging measures, 
cognitive scores and disease status, and ignore the important 
underlying interacting relationships between these units. 
Results: To overcome this limitation, in this article, we propose 
a new sparse multimodal multitask learning method to reveal 
complex relationships from gene to brain to symptom. Our main 
contributions are three-fold: (i) introducing combined structured 
sparsity regularizations into multimodal multitask learning to integrate 
multidimensional heterogeneous imaging genetics data and identify 
multimodal biomarkers; (ii) utilizing a joint classification and 
regression learning model to identify disease-sensitive and cognition- 
relevant biomarkers; (iii) deriving a new efficient optimization 
algorithm to solve our non-smooth objective function and providing 
rigorous theoretical analysis on the global optimum convergency. 
Using the imaging genetics data from the Alzheimer's Disease 
Neuroimaging Initiative database, the effectiveness of the proposed 
method is demonstrated by clearly improved performance on 
predicting both cognitive scores and disease status. The identified 
multimodal biomarkers could predict not only disease status but also 
cognitive function to help elucidate the biological pathway from gene 
to brain structure and function, and to cognition and disease. 
Availability: Software is publicly available at: http://ranger.uta.edu/ 
%7eheng/multimodal/ 
Contact: heng@uta.edu; shenli@iupui.edu 



1 INTRODUCTION 

Recent advances in acquiring multimodal brain imaging and 
genome- wide array data provide exciting new opportunities to study 
the influence of genetic variation on brain structure and function. 
Research in this emerging field, known as imaging genetics, holds 
great promise for a system biology of the brain to better understand 
complex neurobiological systems, from genetic determinants to 
cellular processes to the complex interplay of brain structure, 
function, behavior and cognition. Analysis of these multimodal 
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datasets will facilitate early diagnosis, deepen mechanistic 
understanding and improved treatment of brain disorders. 

Machine learning methods have been widely employed to predict 
Alzheimer's disease (AD) sta t us using im a ging genetics measures 
teatmanghelich et all l2009l: iFan et all hood iHinrichs et all 
l2009bl ; IShen et all l2010ak Since AD is a neurodegenerative 
disorder characterized by progressive impairment of memory 
and other cognitive functions, regression models have also been 
investigated to predict clinical scores from structural, such as 
magnetic resonance imaging (MRI), and/or molecular, such as 
fluorodeoxyglucose positron emission tomography (FDG-PET) , 
neuroimagin g data dStonnington et all l2010l : IWalhovd et all |2010|) . 
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For example. IWalhovd et all (120101) performed stepwise regression 
in a pairwise fashion to relate each of MRI and FDG-PET measures 
of eight candidate regions to each of four Rey's Auditory Verbal 
Learning Test (RAVLT) memory scores. This univariate approach, 
however, did not consider either interrelated structures within 
imaging da ta or those within cognitive data. Using relevance vector 
regression, Istonnington et al\ d2010h jointly analyzed the voxel- 
based morphometry (VBM) features extracted from the entire brain 
to predict each selected clinical score, while the investigations of 
different clinical scores are independent from each other. 

One goal of imaging genetics is to identify genetic risk factors 
and/or imaging biomarkers via intermediate quantitative traits (QTs, 
e.g. cognitive memory scores used in this article) on the chain 
from gene to brain to symptom. Thus, both disease classification 
and QT prediction are important machine learning tasks. Prior 
imaging genetics research typically employs a two-step procedure 
for identifying risk factors and biomarkers: one first determines 
disease-relevant QTs, and then detects the biomarkers associated 
with these QTs. Since a QT could be related to many genetic or 
imaging markers on different pathways that are not all disease 
specific (e.g. QT 2 and Gene 3 in Fig. [T}, an ideal scenario would 
be to discover only those markers associated with both QT and 
disease status for a better understanding of the underlying biological 
pathway specific to the disease. 

On the other hand, identifying genetic and phenotypic biomarkers 
from large-scale multidimensional heterogeneous data is an 
important biomedical and biological research topic. Unlike 
simple feature selection working on a single data source, 
multimodal learning describes the setting of learning from 
data where observations are represented by multiple types of 
feature sets. Many multimodal methods have been developed 
for cla s sificat i on and clustering purposes , such as co- training 
dAbnevL 120021 : iBrefeld and ScheffeJ. |2004 iGhanil 120021 : Nigam 
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Fig. 1 . A simplified schematic example of two pathways from gene to QTs 
to phcnatypic endpoints: the red one is disease relevant while the blue one 
yields c nly normal variation. Traditional two-stage imaging genetic strategy 
identifies QT 1 and QT 2 first and then Genes 1, 2, 3. Our new method will 
identify only disease relevant genes (i.e. Gene 1 and Gene 2); and Gene 3 
wouk i ot be identified because it cannot be used to classify disease status 



et a7. .l2OO0h and multiview clustering jBickel and Scheffen. Eooi 
iDhillon et all 120031) . However, they typically assume that the 
multimodal feature sets are conditionally independent, which does 
not hold in many real-world applications such as imaging genetics. 
Considering different representations give rise to different kernel 
functions, se v eral Multiple Kernel Learning (MKL) approaches 
jBach et all 12004 iHinrichs et all \20O9i ikloft et all 120081: 



lLanckriet et all 120041: iRakotomamoniv et al. . 120071: Sonnen burg 
etal . bOOdlSuvkens a/.ll2002l:lYe a/.Ll2008l:lYu etaMlQltk Zien 



and Ong J2007I) have been recently studied and employed to integrate 
heterogeneous data and select multitype features. However, such 
models train a single weight for all features from the same modality, 
i.e. all features from the same data source are weighted equally, 
when they are combined with the features from other sources. This 
limitation often yields inadequate performance. 

To address the above challenges, we propose a new 
sparse multimodal multitask learning algorithm that integrates 
heterogeneous genetic and phenotypic data effectively and 
efficiently to identify disease- sensitive and cognition-relevant 
biomarkers from multiple data sources. Different t o LASSO 
iTibshiraniL [l996h . group LASSO (lYuan and LinLl2006h and other 
related methods that mainly find the biomarkers correlated to each 
individual QT (memory score), we consider predicting each memory 
score as a regression task and select biomarkers that tend to play an 
important role in influencing multiple tasks. A joint classification 
and regression multitask learning model is utilized to select the 
biomarkers correlated to memory scores and disease categories 
simultaneously. 



and applied to multitask learning models ( 
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by imposing non- smooth norms as regularizers in the optimization 
problems. From the view of sparsity organization, we have two 
types: (i) The flat sparsity is often achieved by ^o _norm or ^i-norm 
regularizer or trace norm i n matrix/tensor com pletion. Optimization 
techniques inclu de LARS tefron gfa/.Ll2004h. linear gradient se arch 



iLiu gf a/.Ll2009k proximal methods dBeck and TeboulleLl2009h . (ii) 



The structured sparsity is usuall y obtained through diffe rent sparse 
regul arizers such as 1 9 i-norm iKim and Xingj l2010l: Obozinski 
et al, 120101: ISun et all \ 



^oo,l _norm 4 



[). lo n-norm (iLuo et all l2010h . 
) (also denote d as i\ 9 -norm , 1 1 qq- 



norm in different papers) and group £ 1 -norm (lYuan and Linl 120061) 
which can be solved by methods in iMicchelli et al. .1 boiciT and 
lArgyriou et al\ d2008h . We propose a new combined structured sparse 
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Fig. 2. The proposed sparse multimodal multitask feature selection method 
will identify biomarkers from multimodal heterogeneous data resources. The 
identified biomarkers could predict not only disease status, but also cognitive 
functions to help researchers better understand the underlying mechanism 
from gene to brain structure and function, and to cognition and disease 



regularization to integrate features from different modalities and to 
learn a weight for each feature leading to a more flexible scheme for 
feature selection in data integration, which is illustrated in Figure [3] 
In our combined structured sparse regularization, the group l\- 
norm regularization (blue circles in Fig.|3} learns the feature global 
importance, i.e. the modal-wise feature importance of every data 
modality on each class (task), and the £2,l _norm regularization (red 
circles in Fig. 0 explores the feature local importance, i.e. the 
importance of each feature for multiple classes/tasks. The proposed 
method is applied to identify AD-sensitive biomarkers associated 
with the cognitive scores by integrating heterogeneous genetic and 
phenotypic data (as shown in Fig. [2}. Our empirical results yield 
clearly improved performance on predicting both cognitive scores 
and disease status. 



2 IDENTIFYING DISEASE SENSITIVE AND 
QT-RELEVANT BIOMARKERS FROM 
HETEROGENEOUS IMAGING GENETICS DATA 

Pairwise univariate correlation analysis can quickly provide 
important association information between genetic/phenotypic data 
and QTs. However, it treats the features and the QTs as 
independent and isolated units, therefore the underlying interacting 
relationships between the units might be lost. We propose a new 
sparse multimodal multitask learning model to reveal genetic and 
phenotypic biomarkers, which are disease sensitive and QT-relevant, 
by simultaneously and systematically taking into account an 
ensemble of SNPs (single nucleotide polymorphism) and phenotypic 
signatures and jointly performing two heterogeneous tasks, i.e. 
biomarker-to-QT regression and biomarker-to-disease classification. 
The QTs studied in this article are the cognitive scores. 

In multitask learning, given a set of input variables (i.e. features 
such as SNPs and MRI/PET measures), we are interested in learning 
a set of related models (e.g. relations between genetic/imaging 
markers and cognitive scores) to predict multiple outcomes (i.e. 
tasks such as predicting cognitive scores and disease status). Because 
these tasks are relevant, they share a common input space. As a 
result, it is desirable to learn all the models jointly rather than 
treating each ta sk as independent and fitting each m odel separately, 
such as Lasso (ITibshiraniL 1 19961) and group Lasso (lYuan and Linl 
120061) . Such multitask learning can discover robust patterns (because 
significant patterns in a single task could be outliers for other tasks) 
and potentially increase the predictive power. 
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In this article, we write matrices as uppercase letters and vectors 
as boldface lowercase letters. Given a matrix W = [wy], its i-th row 
and j-th column are denoted as w* and wy, respectively. The ^2,1" 
norm of the matrix W is defined as ||W||2,i =S/=l Il w *ll2 ( a l so 
denoted as l\ 2-norm by other researchers). 

2.1 Heterogeneous data integration via combined 
structured sparse regularizations 

First, we will systematically propose our new multimodal learning 
method to integrate and select the genetic and phenotypic biomarkers 
from large-scale heterogeneous data. In the supervised learning 
setting, we are given n training samples {(x/,y/)}^ =1 , where x; = 
(x j , • • • , x^) T e $R d is the input vector including all features from a 
total of k different modalities and each modality j has dj features 

(d = Ylj=\ dj)- Ji e $i c is the class label vector of data point x; (only 
one element in y/ is 1, and others are zeros), where c is the number 
of classes (tasks). Let X = [x\ , • • • , x„] e m d xn and Y = [y i , • • • , y c ] e 
?ft cxn . Different to MKL, we directly learn ad xc parameter matrix 
as: 



—k types of d features- 



w= 



w; 



w7 



w, 



1 1 



em' 



dxc 



(1) 



where vrpeSR d * indicates the weights of all features in the q-th 
modality with respect to the p-th task (class). Typically, we can use 
a convex loss function C(X , W) to measure the loss incurred by W 
on the training samples. Compared with MKL approaches that learn 
one weight for one kernel matrix representing one modality, our 
method will learn the weight for each feature to capture the local 
feature importance. Since the features come from heterogeneous 
data sources, we impose the regularizer 7Z(W) to capture the 
interrelationships of modalities and features as: 



min £(X,W) + y1l(W), 



(2) 



where y is a trade-off parameter. In heterogeneous data fusion, 
from multiview perspective of view, the features of a specific view 
(modality) can be more or less discriminative for different tasks 
(classes). Thus, we propose a new group £i-norm (Gi-norm) as a 
regularization term in Equation Q), which is defined over W as 
following: 



c k 

i=ij=i 



(3) 



which is illustrated by the blue circles in Figure [3] Then the 
Equation 0 becomes: 



mmC(X,W) + yi\\W\\ Gl 
W 



(4) 



Since the group i \ -norm uses £2 _norm within each modality and i \ - 
norm between modalities, it enforces the sparsity between different 
modalities, i.e. if one modality of features are not discriminative 
for certain tasks, the objective in Equation (4} will assign zeros 
(in ideal case, usually they are very small values) to them for 
corresponding tasks; otherwise, their weights are large. This new 
group i\ -norm regularizer captures the global relationships between 
data modalities. 



wl- 



Fig. 3. Illustration of the feature weight matrix W T . The elements in matrix 
with deep blue color have large values. The group £i-norm (Gi-norm) 
emphasizes the learning of the group-wise weights for a type of features 
(e.g. all the SNPs features, or all the MRI imaging features, or all the 
FDG-PET imaging features) corresponding to each task (e.g. the prediction 
for a disease status or a memory score) and the £2,1 -norm accentuates the 
individual weight learning cross multiple tasks 

However, in certain cases, even if most features in one modality 
are not discriminative for the classification or regression tasks, a 
small number of features in the same modality can still be highly 
discriminative. From the multitask learning point of view, such 
important features should be shared by all/most tasks. Thus, we 
add an additional ^2,l _norm regularizer into Equation 0 as: 



minC(X,W) + yi \\W\\ Gl 
w 



+ 7211^112,1- 



(5) 



The 1 7 1 -norm was popularly used in multi task feature selection 
(lArgvriou graZll2Q08l : rObozinski graZLl2010h . Since the £ 2 ,l-norm 
regularizer impose the sparsity between all features and non- sparsity 
between tasks, the features that are discriminative for all tasks will 
get large weights. 

Our regularization items consider the heterogeneous features from 
both group- wise and individual viewpoints. Figure |3] visualizes the 
matrix W T as a demonstration. In Figure [3] the elements with deep 
blue color have large values. The group ^i-norm emphasizes the 
group-wise weights learning corresponding to each task and the 
^2,1 -norm accentuates the individual weight learning cross multiple 
tasks. Through the combined regularizations, for each task (class), 
many features (not all of them) in the discriminative modalities 
and a small number of features (may not be none) in the non- 
discriminative modalities will learn large weights as the important 
and discriminative features. 

The multidimensional data integration has been increasingly 
important to many biological and biomedical studies. So far, the 
MKL methods are most widely used. Due to the learning model 
deficiency, the MKL methods cannot explore both modality-wise 
importance and individual importance of features simultaneously. 
Our new structured sparse multimodal learning method integrates 
the multidimensional data in a more efficient and effective way. The 
loss function C(X , W) in Equation {8} can be replace by either least 
square loss function or logistic regression loss function to perform 
regression/classification tasks. 

2.2 Joint disease classification and QT regression 

Since we are interested in identifying the disease- sensitive and 
QT-relevant biomarkers, we consider performing both logistic 
regression for classifying disease status and multivariate regression 
for predicting cognitive memory scores simultaneously (W ang 
et a/.. l201lh . A similar model was used in I Yang et all d2Q09h for 
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heterogeneous multitask learning. Regular multitask learning only 
considers homogeneous tasks such as regression or classification 
individually. Joint classification and regression can be regarded as a 
learning paradigm for handling heterogeneous tasks. 

First, logistic regression is used for disease classification, which 
minimizes the following loss function: 



i=ik=i \ 



Lio g f>< x 

V 1=1 



(6) 



Here, we perform three binary classification tasks for the following 
three diagnostic groups respectively (ci=3): AD, mild cognitive 
impairment (MCI), and health control (HC). 

Second, we use the traditional multivariate least squares 
regression model to predict memory scores. Under the regression 
matrix P e 1R dxC2 , the least squares loss is defined by 



\\x T p-z\\ 



(7) 



where X is the data points matrix, P is the coefficient 
matrix of regression with c 2 tasks, the label matrix Z — 

T / ~\T 



€ Wixc 2m 



We perform the joint classification and regression tasks, the 
disease- sensitive and QT-relevant biomarker identification task can 
be formulated as the following objective: 



n ci / ci 

i=\k=\ \ 1=1 



(8) 



+ \\x a p-z \\+n\\v\\ Gl +y 2 \\v\\ 2A , 

II II F 



where V = [W P]e$i dx ( Cl + Cl) . As a result, the identified biomarkers 
will be correlated to memory scores and also be discriminative to 
disease categories. 

Since the objective in Equation {8} is a non- smooth problem 
and cannot be easily solved in general, we derive a new efficient 
algorithm to solve this problem in the next subsection. 

2.3 Optimization algorithm 

We take the derivatives of Equation ([8} with respect to W and P 
respectively, and set them to zeros, we have 



3Ci(W) 

dw 



+ 2 Yl $S>/ w f +2Y2DW = 0, 



(9) 



i=\ 

C2 



2XX T P-2XZ + 2yi D i p i -\-2y 2 DP = 0, (10) 

/=Ci+l 

where D((l <i<c\ +c 2 ) is a block diagonal matrix with the k- 
th diagonal block as 



2 v; 



jy-Ik (Jk is a d>k by dk identity matrix), 

2 

1 



D is a diagonal matrix with the k-th diagonal element as y„ k ,, , 

z ll v lb 

Since D/(l <i<c\ +c 2 ) and D depend on V = [W P], they are 
also unknown variables to be optimized. In this article, we provide 
an iterative algorithm to solve Equation {8}. First, we guess a 
random solution V e1R dx ( Cl + C2 \ then we calculate the matrices 
Di(l<i<ci+c 2 ) and D according to the current solution V. After 



obtaining the D;(l < i < c\ +c 2 ) and D, we can update the solution 
V = [ W P] based on Equation J9}. Specifically, the i-th column of 
P is updated by p; = (XX T + y\D[ + y 2 D)~ l Xzi. We cannot update 
W with a closed form solution based on Equation (9}, but we can 
obtained the updated W by the Newton's method. According to 
Equation J9}, we need to solve the following problem: 



min Ci(W) + n J2 w f D i w i + Y 2 Tr(W T DW). 
i=\ 



(id 



Similar to the tradi t ional method in the l ogistic regression 
dKrishnapuram et ~al 1120011 Lee et al 1 I2006L we can use the 
Newton's method to obtain the solution W. 

For the first term, the traditional logistic regression derivatives 
can be applied to get the first-and second-order derivatives ( Lee 
et q/. j2006h . 

For the second term, the first-and second-order derivatives are 



i=l 



dW, 



up 



= 2D p (u,u)W up , 



(12) 



i=l 



dW up dW vq 



= 2D p (u,u)8 uv 8 pq , 



where D p (u, u) is the u-th diagonal element of D p . 

For the third term, the first-and second-order derivatives are 



dTriW'DW) 



8W l 



up 



8Tr(W T DW) 
dWupdWyq ' 



--2D(u,u)W up , 



■2D(u,u)8 uv 8 pq . 



(13) 



After obtaining the updated solution V = [ W P ], we can calculate 
the new matrices Di(l<i<ci+c 2 ) and D. This procedure is 
repeated until the algorithm converges. The detailed algorithm is 
listed in Algorithm [T] We will prove that the above algorithm will 
converge to the global optimum. 

2.4 Algorithm analysis 

To prove the convergence of the proposed algorithm, we need a 
lemma as follows. 

Lemma 1. For any vectors v and vq, we have the following 
inequality: ||v|| 2 - ^ S ||v 0 || 2 - 

Proof. Obviously, -(||v|| 2 — ||vq II2) 2 so we have 
-(l|v|l2-||voll2) 2 <0^2||v||2l|vo|| 2 -||v|||<||vo||| 



" 2IIV0II2 

which completes the proof. 



<Hvoll2- 



llvollj 
2||voll 2 ' 



(14) 
□ 



Then we prove the convergence of the algorithm, which is 
described in the following theorem. 

Theorem 1 . The algorithm decreases the objective value of problem 
(0) in each iteration. 
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Proof. In each iteration, suppose the updated W is W, and the 
updated P is P, then the updated V is V = [ W P ]. From Step 3 in 
the Algorithm Q] we know that: 



C\(W) + Yl jyjD&i + y 2 Tr(W T DW) 
i=l 

C\ 

<C\{W) + y\ J2 w f D i w i + YlTr(W T DW). 
i=l 

According to Step 4, we have: 



(15) 



\\X T P - Y + Yi l]pf AP/ + Y 2 Tr(P T DP) 



i=l 

1, 2 

If 



(16) 



|x r P - 7 J ^ + n J2vf DiPi + y 2 Tr(P T DP). 
i=\ 



Based on the definitions of D/(l < i < ci + c 2 ) and D, and Lemma 1, 
we have two following inequalities: 



K 


II n2 ii m2 
K K K \\yk\\ 

T I i <Flkll F I n 

* =1 II 'II2 * =1 II ' II2 










Ci+C 2 7r 

=>*EE 

i=\ k=l 




ci+c 2 K 

-"EE 

i=l k=l 


C1+C2 

2" K1 E v f D ^ 



(17) 



and 



<i d || 



2 ^2\\\ k n n 

k=l W 112 /c=l 



<eh,-e' 



2 ^— '2||v*|| 
k=l H II 



^^Ep1| 2 -^(v r DV) 
^=1 

- rcEp| 2 ~^ (vrDvr) - 

£=1 



(18) 



Algorithm 1 An efficient iterative algorithm to solve the optimization 
problem in Equation |8j. 



Input: 



X = [x u x 2 ,---,x n ]em dxn , 



(j 1 ) T >(j 2 ) T -,(y n ) T ] e{0,i}» 



and 



Y = 
Z = 



Output: Wem dxCl mdPeVi dxC2 . 
1. Initialize Wedi dxc K Pe$i dxc \ Let V = [WP]e 
^dx(ci+c 2 ) > 

repeat 

2. Calculate the block diagonal matrices Di(l<i<c\-\-c 2 ), 



where the &-th diagonal block of D( is 



2 v: 



K . Calculate the 



diagonal matrix D, where the k-th diagonal element is 



1_ 

2||v*ll 



3. Update w by w — B l a, where the d*(p— 1) + 
w(l<w<d,l</?<ci)-th element of aedl dClXl is 

3 j:wjD i w l +y 2 Tr(W T DW)) 

-± l ^-w; p the W*(p-i)+ 

u,d*(q— l) + v)(l <u,v<d, 1 </?,#<ci)-th element of 



B e <${dcixdci 



dl A(W)+yiE)wf AWf+^TrCW 



Construct the updated WeD^ XCl by the updated vector 
weDt^ 1 , where the (w,/?)-th element of W is the 
(d*(p—l) + u)-th element of w. 

4. Update the i-th column of P by p; = (XX r + yiD/ + 

5. Update the V by V = [ W P ] . 
until Converges 



then by adding Equations (IT5UT8I) in the both sides, we arrive at 

ci+c 2 K d 
C X {W)+C 2 {P) + YX J2 EKL + ^EPL 
i=l k=l k=l 

c\+c 2 K d 

< c 1 (wnc 2 (pn Y1 ^ EKL+^EPL- 

i=l k=l k=l 

Therefore, the algorithm decreases the objective value of problem 
lUl in each iteration. □ 

In the convergence, W, P, Di{\<i<c\+c 2 ) and D satisfy the 
Equation {9}. As the Equation ([SJ is a convex problem, satisfying the 
Equation {9} indicates that V = [ W P ] is a global optimum solution 
to the Equation {8}. Therefore, the Algorithm [T] will converge 
to the global optimum of the Equation ([8}. Since our algorithm 
has the closed form solution in each iteration, the convergency is 
very fast. 



Note that the following two equalities: 

Ci+C 2 Ci c 2 

E y f D i y i = T, W f D i W i + Y,vf D M> 

i=\ i=\ i=\ 

Tr(V T DV) = Tr(W T DW) + Tr(P T DP), 



(19) 



3 EMPIRICAL STUDIES AND DISCUSSIONS 

Data used in the preparation of this article were obtained from 
the Alzheimer's Disease Neuroimaging Initiative (ADNI) database 
dadni . loni . ucla . edul l. One goal of ADNI has been to 
test whether serial magnetic resonance imaging (MRI), positron 
emission tomography (PET), other biological markers, and clinical 
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and neuropsychological assessment can be combined to measure 
the progression of MCI and early AD. For up-to-date information, 
see |www. adni- inf o . , org| Following a prior imaging genetics 
study (IShen gfa/.ll2010bh . 733 non-Hispanic Caucasian participants 
were included in this study. We empirically evaluate the proposed 
method by applying it to the ADNI cohort, where a wide range 
of multimodal biomarkers are examined and selected to predict 
memory performance measured by five RAVLT scores and classify 
participants into HC, MCI and AD. 

3.1 Experimental design 

Overall setting: our primary goal is to identify relevant genetic 
and imaging biomarkers that can classify disease status and predict 
memory scores (Fig. [2}. We describe our genotyping, imaging and 
memory data in Section 13.11 present the identified biomarkers in 
Section 13.21 discuss the disease classification in Section 13.31 and 
demonstrate the memory score prediction in Section \3A\ 

Genotyping data: th e single-nucleotide polymorphism (SNP) data 
dSavkin gf a/.ll2010h were genotyped using the Human 610-Quad 
BeadChip (Illumina, Inc., San Diego, CA, USA). Among all SNPs, 
only SNPs, belonging to the top 40 AD candidate genes listed 
on the AlzGene database (www.alzgene.org) as of June 10, 2010, 
were selected after the standard quality control (QC) and imputation 
steps. The QC criteria for the SNP data include (i) call rate check 
per subject and per SNP marker, (ii) gender check, (iii) sibling 
pair identification, (iv) the Hardy-Weinberg equilibrium test, (v) 
marker removal by the minor allele frequency and (vi) population 
stratification. The quality-controlled SNPs were then imputed using 
the MaCH software to estimate the missing genotypes. After that, 
the Illumina annotation information based on the Genome build 36.2 
was used to select a subset of SNPs, belonging or proximal to the top 
40 AD candidate genes. This procedure yielded 1224 SNPs, which 
were annotated with 37 genes lwanggfq/.Ll2012l) . For the remaining 
3 genes, no SNPs were available on the genotyping chip. 

Imaging biomarkers: in this study, we use the baseline structural 
MRI and molecular FDG-PET scans, from which we extract 
imaging biomarkers. Two widely employed automated MRI analysis 
techniques were used to process and extract imaging genotypes 
across the brain from a ll baseline scans of ADNI participants 
as previously described dShen et all l2010bl). First, voxel-based 
morphometry (VBM) dAshburner and FristonLl200ol) was performed 
to define global gray matter (GM) density maps and extract local GM 
density values for 86 target r egions (Fig Rfc). S econd, automated 
parcellation via freeSurfer V4 jFischl etaimOO^) was conducted to 
define 56 volumetric and cortical thickness values (Fig. [4t>) and to 
extract total intracranial vol ume (ICV). Furthe r information about 
these measures is available in lShen g/o/lfeOlObh . All these measures 
were adjusted for the baseline age, gender, education, handedness 
and baseline ICV using the regression weights derived from the 
healt hy control participants. For PET images, following Landau 
et al. (l2009h . mean glucose metabolism (CMglu) measures of 26 
regions of interest (ROIs) in the Montreal Neurological Institute 
(MNI) brain atlas space were employed in this study (Fig.|4j:). 

Memory data: The cognitive measures we use to test the proposed 
method are the baseline RAVLT memory scores from all ADNI 
participants. The standard RAVLT format starts with a list of 15 
unrelated words (List A) repeated over five different trials and 
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Fig. 4. Weight maps for multimodal data: (a) VBM measures from MRI, 
(b) FreeSurfer measures from MRI, (c) glucose metabolism from FDG-PET, 
and (d) top SNP findings. Weights for disease classification were labeled 
as Diag-L (left side), Diag-R (right side) or Diag; and weights for RAVLT 
regression were labeled as AVLT-L, AVLT-R or AVLT. In (a-c), weights 
were normalized by dividing the corresponding threshold used for feature 
selection, and thus all selected features had normalized weights > 1 and were 
marked with 'x' . In (d), only top SNPs were shown, weights were normalized 
by dividing the weight of the 10th top SNP, and the top 10 SNPs for either 
classification or regression task had normalized weights > 1 and were marked 
with 'x' 



Table 1. RAVLT cognitive measures as responses in multitask learning 



Task ID 



Description of RAVLT scores 



TOTAL 


Total score of the first 5 learning trials 


TOT6 


Trial 6 total number of words recalled 


TOTB 


List B total number of words recalled 


T30 


30 minute delay total number of words recalled 


RECOG 


30 minute delay recognition score 



participants are asked to repeat. Then the examiner presents a second 
list of 15 words (List B), and the participant is asked to remember 
as many words as possible from List A. Trial 6, termed as 5 min 
recall, requests the participant again to recall as many words as 
possible from List A, without reading it again. Trial 7, termed as 
30 min recall, is administrated in the same way as Trial 6, but after 
a 30 min delay. Finally, a recognition test with 30 words read aloud, 
requesting the participant to indicate whether or not each word is on 
List A. The RAVLT has proven useful in evaluating verbal learning 
and memory. Table Q] summarizes five RAVLT scores used in our 
experiments. 
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Table 2. Multimodal feature sets as predictors in multiview learning W 



View ID (feature set ID) 


Modality 


No. of features 


VBM 


MRI 


86 


FreeSurfer 


MRI 


56 


FDG-PET 


FDG-PET 


26 


SNP 


Genetics 


1244 



Participant selection: In this study, we included only participants 
with no missing data for all above four types (views) of features and 
cognitive scores, which resulted in a set of 345 subjects (83 HC, 
174 MCI and 88 AD). The feature sets extracted from baseline 
multimodal data of these subjects are summarized in Table [2] 

3.2 Biomarker identifications 

The proposed heterogeneous multitask learning scheme aims to 
identify genetic and phenotypic biomarker s that are associated 
with both cognition (e.g. RAVLT in this study) and disease 
status in a joint regression and classification framework. Here we 
first examine the identified biomarkers. Shown in Figure [4] is a 
summarization of selected features for all four data types, where the 
regression/classification weights are color-mapped for each feature 
and each task. 

In Figure St, many VBM measures are selected to be associated 
with disease status, which is in accordance with known global brain 
atrophy pattern in AD. The VBM measures associated with RAVLT 
scores seem to be a subset of those disease-sensitive markers, 
showing a specific memory circuitry contributing to the disease, 
as well as suggesting that the disease is implicated by not only this 
memory function but also other complicated factors. Evidently, the 
proposed method could have a potential to offer deep mechanistic 
understandings. Shown in Figure|5]is a comparison between RAVLT- 
relevant markers and AD-relevant markers and their associated 
weights mapped onto a standard brain space. 

Figure |4j) shows the identified markers from the FreeSurfer data. 
In this case, a small set of markers are discovered. These markers, 
such as hippocampal volume, amygdala volume and entorhinal 
cortex thickness, are all well-known AD-relevant markers, showing 
the effectiveness of the proposed method. These markers are also 
shown to be associated with both AD and RAVLT. The FDG-PET 
findings (Fig. |4t) are also interesting and promising. The AD- 
relevant biomarkers include angular, hippocampus, middle temporal 
and post cingulate re gions, which agrees with prior findings e.g. 
Landau et al. ] d2009h . A gain, a subset of these markers are also 
relevant to RAVTL scores. 

As to the genetics, only top findings are shown in Figure |4jl. The 
APOE E4 SNP (rs429358), the best known AD risk factor, shows the 
strongest link to both disease status and RAVLT scores. A few other 
important AD genes, including recently discovered and replicated 
PICALM and BIN1 , are also included in the results. For those newly 
identified SNPs, further investigation in independent cohorts should 
be warranted. 

3.3 Improved disease classification 

We classify the selected participants of ADNI cohort using the 
proposed methods by integrating the four different types of data. 



Fig. 5. VBM weights of joint regression of AVLT scores and classification 
of disease status were mapped onto brain (a) Overall weights for disease 
classification; (b) Overall weights for AVLT regression 

We report the classification performances of our method. We 
compare our methods against several most recent MKL methods that 
are able to ma ke use of multiple types of data including SVM Iqq 
MK L method JSonnenbum et all\200d) .SVM li MKL (La nckriet 
et a/.. l20Q4h . SVM to MKL method iKloft et all 120081) . least 
square (LSSVM) MKL method <Ye et all l2008h . LSSVM 
il MKL method iSuvkens et all l2002h and LSSVM i 2 MKL 
method (lYu et all l2010h . We also compare a rel ated method, 
Heter ogeneous Multitask Learning (HML) method (lYang et ~all 
l2Q09h . which simultaneously conducts classification and regression 
like our method. However, because this method is designed for 
homogenous input data and is not able to deal with multiple types 
of data at the same time, we concatenate the four types of features 
as its input. In addition, we report the classification performances 
by our method and SVM on each individual types of data as 
baselines. SVM on a simple concatenation of all four types of 
features are also reported. In our experiments, we conduct three- 
class classification, which is more desirable and more challenging 
than binary classifications using each pair of three categories. 

We conduct standard 5-fold cross-validation and report the 
average results. For each of the five trials, within the training 
data, an internal 5-fold cross-validation is performed to fine 
tune the parameters. The parameters of our methods [y\ 
and y2 in Equation ([8}] are optimized in the range of 
{ 10 -5 , 1(T 4 , . . . , 10 4 , 10 5 } . For SVM method and MKL methods, 
one Gaussian kernel is constructed for each type of features 
i.e.JC{xi,Xj) = exp^— y ||xj— x j II2)]' wnere me parameters y are 
fine tuned in the same range used as our metho d. We implemen t 
the MKL methods us i ng the codes published bv lYu et aZN2010l) . 
Following lYu et all bold) , in LSSVM ioc and i 2 methods, 
the regularization parameter k is estimated jointly as the kernel 
coefficient of an identity matrix; in LSSVM i\ method, X is set to 1; 
in all other SVM approaches, the C parameter of the box constraint is 
set to 1 . We use LIBSVM (http://www.csie.ntu.edu.tw/ cjlin/libsvm/) 
software package to implement SVM. We implement HML method 
following the details in its original work, and set the parameters to be 
optimal. The classification performances measured by classification 
accuracy of all compared methods in AD detection are reported in 
TableE] 

A first glance at the results shows that our methods consistently 
outperform all other compared methods, which demonstrates the 
effectiveness of our methods in early AD detection. In addition, 
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Table 3. Classification performance comparison between the proposed 
method and related methods for distinguishing HC, MCI and AD 



Methods 


Accuracy (mean + SD) 


SVM (SNP) 


0.561 ± 0.026 


SVM (FreeSurfer) 


0.573 ± 0.012 


SVM (VBM) 


0.541 ± 0.032 


SVM (PET) 


0.535 ± 0.026 


ovivi (ail) 


U.j /j ± U.Uiy 


rilVlL, (all) 


U.oJo ± u.uiy 


SVM too MKL method 


0.624 dz 0.031 


SVM€i MKL method 


0.593 ± 0.042 


SVM i 2 MKL method 


0.561 ± 0.037 


LSSVM loo MKL method 


0.614 ±0.031 


LSSVM€i MKL method 


0.585 ±0.018 


LSSVM i 2 MKL method 


0.577 ± 0.033 


Our method (SNP) 


0.673 ± 0.021 


Our method (FreeSurfer) 


0.689 ± 0.029 


Our method (VBM) 


0.669 ±0.031 


Our method (PET) 


0.621 ± 0.028 


Our method 


0.726 ± 0.032 



the methods using multiple data sources are generally better than 
their counterparts using one single type of data. This confirms 
the usefulness of data integration in AD diagnosis. Moreover, our 
methods always outperform the MKL methods in these experiments, 
although both take advantage of multiple data sources. This 
observation is consistent with our theoretical analysis. That is, our 
methods not only assign proper weight to each type of data, but also 
consider the relevance of the features inside each individual type of 
data. In contrast, the MKL methods address the former while not 
taking into account the latter. 

3.4 Improved memory performance prediction 

Now we evaluate the memory performance prediction capability of 
the proposed method. Since the cognitive scores are continuous, 
we evaluate the proposed method via regression and compare it 
to two baseline methods, i.e. multivariate linear regression (MRV) 
and ridge regression. Since both MRV and ridge regression are for 
single- type input data, we conduct regression on each of the four 
types of features and a simple concatenation of them. Similarly, we 
also predict memory performance by our method on the same test 
conditions. When multiple-type input data are used, as demonstrated 
in Section [3^21 our method automatically and adaptively select the 
prominent biomarkers for regression. For each test case, we conduct 
standard 5-fold cross-validation and report the average results. For 
each of the five trials, within the training data, an internal 5-fold 
cross-validation is performed to fine tune the parameters in the range 
of 1 10 5 , 10 -4 , 10 4 , 10 5 J for both ridge regression and our 
method. For our method, in each trial, from the learned coefficient 
matrix we sum the absolute values of the coefficients of a single 
feature over all the tasks as the overall weight, from which we pick 
up the features with non-zero weights (i.e. w>10 -3 ) to predict 
regression responses for test data. The performance assessed by 
root mean square error (RMSE), a widely used measurement for 
statistical regression analysis, are reported in Table [4] 



Table 4. Comparison of memory prediction performance measured by 
average RMSEs (smaller is better) 



Test case 


TOTAL 


TOT6 


TOTB 


T30 


RECOG 


MRV (SNP) 


6.153 


2.476 


2.168 


2.201 


3.483 


MRV (FreeSurfer) 


5.928 


2.235 


2.039 


2.088 


3.339 


MRV (VBM) 


6.093 


2.289 


2.142 


2.137 


3.394 


MRV (PET) 


6.246 


2.514 


2.237 


2.215 


3.615 


MRV (all) 


5.909 


2.232 


1.992 


2.032 


3.306 


Ridge (SNP) 


6.076 


2.416 


2.147 


2.117 


3.368 


Ridge (FreeSurfer) 


5.757 


2.203 


2.004 


2.017 


3.237 


Ridge (VBM) 


5.976 


2.147 


2.038 


2.129 


3.249 


Ridge (PET) 


6.153 


2.443 


2.186 


2.107 


3.515 


Ridge (all) 


5.704 


2.143 


1.989 


1.994 


3.193 


Our method (SNP) 


5.991 


2.201 


2.008 


2.001 


3.107 


Our method (FreeSurfer) 


5.601 


2.106 


1.947 


1.886 


3.015 


Our method (VBM) 


5.715 


2.011 


1.899 


1.974 


3.041 


Our method (PET) 


6.013 


2.241 


2.017 


2.017 


3.331 


Our method (all) 


5.506 


1.984 


1.886 


1.841 


2.989 



From Table 0] we can see that the proposed method always has 
better memory prediction performance. Among the test cases, the 
FreeSurfer imaging measures and VBM imaging measure have 
similar predictive power, which are better than those of PET imaging 
measures and SNP features. In general, combining the four types of 
features are better than only using one type of data. Since our method 
adaptively weight each type of data and each feature inside a type 
of data, it has the least regression error when using all available 
input data. These results, again, demonstrated the usefulness of our 
method and data integration in early AD diagnosis. 

4 CONCLUSIONS 

We proposed a novel sparse multimodal multitask learning 
method to identify the disease-sensitive biomarkers via integrating 
heterogeneous imaging genetics data. We utilized the joint 
classification and regression learning model to identify the 
disease- sensitive and QT-relevant biomarkers. We introduced a 
novel combined structured sparsity regularization to integrate 
heterogeneous imaging genetics data, and derived a new efficient 
optimization algorithm to solve our non- smooth objective function 
and followed with the rigorous theoretical analysis on the 
global convergency. The empirical results showed our method 
improved both memory scores prediction and disease classification 
accuracy. 
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